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Abstract 

We find large deviations rates for consensus-based distributed inference for directed networks. When 
the topology is deterministic, we establish the large deviations principle and find exactly the corresponding 
rate function, equal at all nodes. We show that the dependence of the rate function on the stochastic 
weight matrix associated with the network is fully captured by its left eigenvector corresponding to the 
unit eigenvalue. Further, when the sensors’ observations are Gaussian, the rate function admits a closed 
form expression. Motivated by these observations, we formulate the optimal network design problem 
of finding the left eigenvector which achieves the highest value of the rate function, for a given target 
accuracy. This eigenvector therefore minimizes the time that the inference algorithm needs to reach 
the desired accuracy. For Gaussian observations, we show that the network design problem can be 
formulated as a semidefinite (convex) program, and hence can be solved efficiently. When observations 
are identically distributed across agents, the system exhibits an interesting property: the graph of the rate 
function always lies between the graphs of the rate function of an isolated node and the rate function 
of a fusion center that has access to all observations. We prove that this fundamental property holds 
even when the topology and the associated system matrices change randomly over time, with arbitrary 
distribution. Due to generality of its assumptions, the latter result requires more subtle techniques than 
the standard large deviations tools, contributing to the general theory of large deviations. 
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I. Introduction 

The field of wireless sensor networks (WSN) has significantly evolved since its beginnings about 
two decades ago. Starting from wildlife monitoring, smart housing, and building and infrastructure 
surveillance 0, the applications of WSNs have grown both in diversity and in scale. They now include 
monitoring and control of some highly complex large scale systems, such as vehicular networks and 
electric power grids. One of the important emerging trends in this field are also networks consisting of 
thousands of very small and simple sensing devices, such as microrobots 0 and nano-networks 0. 

Due to the increased complexity and scale of WSNs, there has been significant interest recently in 
algorithms that process network information using local communications only 0, 0, 0. A representa¬ 
tive of this class of algorithms is the consensus algorithm 0, [0, |j9]|. With consensus algorithms, each 
agent maintains over iterations an estimate of the quantity of interest and over time it communicates 
the estimate to its immediate neighbors. In addition, intertwined with local communications are local 
agents’ innovations, where agents collect new measurements and incorporate them in an iterative fashion 
in their current estimates. Algorithms of this form referred to as consensus+innovations iflOll possess 
several desirable features, including scalability and simplicity of implementation. Further, they are robust 
to structural changes in the system, such as node failures and intermittent communications, which are 
typical for complex systems consisting of many structurally simple devices. In terms of applications, con¬ 
sensus algorithms have been applied in various different contexts: distributed Kalman filtering ifTTI . lfl2l . 
distributed detection 0, IT3Tl . lt8l . lfl4l and parameter estimation 0, fl5l . flOl . distributed learning fl6l . 
and tracking llT7ll . 

In this paper, we study large deviations performance of consensus algorithms when the underlying 
network is directed. This complements the existing work that usually studies asymptotic variance or 
asymptotic normality iflOl . lll~8l . Our goal is to compute (or characterize-when exact computation is not 
possible) the rates at which the local nodes’ estimates converge to the desirable values (e.g., the vector 
of true parameters that are being estimated). To explain the relevance of large deviations performance, 
consider, for example, a binary hypothesis testing problem in a WSN. In this context, the rates of large 
deviations correspond to error exponents, i.e., they provide answers to how fast the error probabilities - 
false alarm, missed detection, or total error probability decay with time. In the context of estimation, large 
deviation rates provide estimates of times to reach a desired accuracy region around the true parameter 
that the local estimates converge to. Naturally, the higher the rate of a node, the better is the decision or 
estimation produced by that node at a given time. One particular goal of this paper is to provide answers 
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to questions such as: “How much faster a node in a network filters out the estimation noise compared 
to a node that operates alone?” 

Contributions. We consider both cases when the local nodes’ interactions are deterministic and when 
they are random, where the local interactions are captured by associated stochastic system matrice^] For 
the deterministic case, we prove the large deviation principle at each node, and we find the corresponding 
rate function, equal at all nodes. We prove that its dependence on the (stochastic) system matrix A is 
fully captured by the left eigenvector a of A associated with the eigenvalue one, i.e., the left Perron 
vector of A. When the observations are Gaussian- independent, but non-identically distributed, we find 
a closed form expression for the rate function. Motivated by the fact that the rate function strongly 
depends on the eigenvector a, we formulate the following network design problem. For a given accuracy 
region, find the optimal vector a that maximizes the value of the rate function on this fixed region. We 
further show that for Gaussian observations with equal means (but different covariance matrices), this 
problem can be formulated as a semidefinite program (SDP) and thus can be solved efficiently. Simulation 
examples demonstrate that the optimized system significantly outperforms the system with the uniform 
left eigenvector a that, in a sense, equally “weighs” all of the nodes’ estimates. Finally, considering the 
special case when the observations are independent and identically distributed (i.i.d.), we reveal a very 
interesting property: the rate function, independently of the choice of A, always lies between the rate 
function of an isolated node and the rate function of a fusion center. Intuitively, this means that the 
distributed system is always better than an isolated node, and that, on the other hand, can never beat the 
performance of a fusion center. Moreover, we prove that this fundamental property holds with random 
system matrices of arbitrary distribution (including, e.g., temporal dependencies), as long as they are 
independent from the observations. Due to the generality of the assumptions, the proof of this result 
requires much more sophisticated techniques than the deterministic case, which improve over the state 
of the art large deviations techniques and hence constitute a contribution of its own. 

Related work. Large deviations asymptotic performance of consensus+innovations algorithms has been 
previously studied in ll9l. fT3ll . lfl9l . lf20ll . and I2TI . Reference lfl9l studies large deviations of the stochastic 
Riccatti equation for the distributed Kalman filter, and it provides an upper and a lower bound for the 
large deviations rate function. Reference lf20ll considers a consensus based distributed detection with 
constant learning step. They show that the local decision statistics satisfy the large deviations principle and 
characterize the corresponding rate function. Reference ll2ll studies belief formations in social networks 

'With a stochastic matrix, rows sum to one, and all the entries are nonnegative. 


April 29, 2015 


DRAFT 


4 


and characterizes error exponents (Kullback-Leibler divergences) for the distributed multiple hypothesis 
testing problem. In our previous work |9l. lfl3l . we considered the case of i.i.d. networks, where each 
topology realization is symmetric. Under this model, reference @ finds an upper and a lower bound for 
the rate function when the observations are Gaussian, and reference llT3l extends the results of f9l to 
arbitrary distributions of sensor observations. In this work, we go beyond these results in several important 
directions. First, we study here directed random networks, and, furthermore, we make no restrictions on 
the distribution of the system matrices; in particular, we allow for their arbitrary time correlations. Second, 
when the system matrices are deterministic, asymmetric, we fully characterize the rate function and show 
that it is amenable to optimization. 

Notation. For arbitrary d e N = {1,2,...}, we denote by (),j the d-dimensional vector of all zeros; by 
Id the d-dimensional vector of all ones; by e, the i-th canonical vector of W l (that has value one on the 
i-th entry and the remaining entries arc zero); by Id the d-dimensional identity matrix; by Jd the d x d 
matrix whose all entries equal 1/d. For a matrix A, we let [ A\ij and A V] denote its i,j entry and for a 
vector a £ M rf , we denote its i-th entry by a*, i, j = 1,..., d. For a function / : i —> M, we denote its 

domain by Vf = {x £ : — oo < fix) < +oo}; the subdifferential (gradient, when / is differentiable) 

of / at a point x by df{x) (V/(x)); log denotes the natural logarithm; for two sequences f) and g t that 
are asymptotically equal at the logarithmic scale, linq-^-oo log ft / log gt = 1, we shortly write ft ~ gt- 
For N £ N, we denote by A ,v ~ 1 the probability simplex in M A and by a the generic element of this set: 

A^" 1 = { ol £ A : ( i / X s 0, f / ■_] oti — 1 j. We let Amax and A 2 , respectively, denote the maximal and 

the second largest (in modulus) eigenvalue of a square matrix; f denotes the pseudoinverse of a square 
matrix; and || ■ || denotes the spectral norm. For a matrix S £ M. NxN , we let H(S) denote the range of 
S, H{S) = {Sx : x £ M^}, and for N square matrices S\, ...,Sn, we let diag{Si, ...,S/v} denote the 
block-diagonal matrix whose ith block is ,5), for i = 1..... N. An open Euclidean ball in W l of radius 
p and centered at x is denoted by B x {p)\ the closure, the interior, the boundary, and the complement 
of an arbitrary set D C M. d are respectively denoted by D, D°, dD, and D c \ B(R d ) denotes the Borel 
sigma algebra on M rf ; U denotes the probability space and oj denotes an element of fl; P and E denote 
the probability and the expectation operator; Af{m, S) denotes Gaussian distribution with mean vector 
m and covariance matrix S. 

Paper organization. In Section [Tl] we present the system model and formulate the problem that we 


study. In Section III we give preliminaries. Section IV presents our results for the deterministic case. 
Using the results of Section [WJ Section [V] formulates the network design problem and solves it for the 
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case of Gaussian observations with equal means. Section VI presents the fundamental bounds on the 
rate function for the generic case, when system matrices are random; proofs of this result are given in 
Subsections VI-A and VI-B Simulation results are presented in Section VII[ and the conclusion is given 
in Section [VIT11 


II. Problem setup 

This section explains the system model and the distributed inference algorithm that we study. 
Network observations. Suppose that we have N geographically distributed agents (e.g., sensors, robots, 
humans) that monitor and collect observations about their environment. We denote the set of agents by 
V = {1,2,..., N} such that i e V denotes the i- th agent. At each new time instant t = 1,2,..., each 
agent produces a d-dimensional observation vector. We denote by Z l>t £ W l the observation vector of 
agent i at time t, where we assume that the measurements are made synchronously across all agents. We 
denote by rrq the expected value of observations at node i, rn, = IE \Z t j] (constant for all t). 
Inter-agent communication. We assume that a direct communication is possible only between a subset of 
agents’ pairs, e.g., the agents that are close enough to each other. (For instance, in a WSN, communication 
links are established only between sensors that lie within a certain, predefined distance r from each 
other.) We model the possible inter-agent communications via a directed graph G = (V. E), where set 
E C V x V collects all possible (directed) communication links, i.e., all pairs (j, i) such that agent i 
can receive messages from agent j in a single hop manner. The links in E should be understood only 
as potential communication channels. In other words, at a certain time t, agent j may decide whether to 
send or not send a message to agent i. Also, in the case a message from j to i was sent, its reception 
at i could be unsuccessful due to imperfect channel effects (e.g., fading). For any link (J, i) £ E, we 
say that (j, i) is active at time t if at time t a message is sent from j and successfully received at i. 
We let Ef denote the set of all active links at time t. Accordingly, the neighborhood of node i at time t 
is Oi ; t = {j : (j, i) £ E t }, that is, () t: t is the set of all active links at time t that are pointing to for 
any j £ O^t, we say that j is an active neighbor of i. Finally, we denote by Gt = (V, Et) the graph 
realization at time t. 

Consensus+innovations based distributed inference. The distributed inference algorithm that we study 
operates as follows. Each node, over time, maintains a d-dimensional vector that serves as the node’s 
estimate on the state of nature. The estimate of node i at time t is denoted by X, j. and we also refer 
to it as the state of node i. The estimates (states) are continuously improved over time twofold. First, 
each agent i incorporates its new observation Z % x into its current state with the weight 1/t and forms an 
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intermediate state update; subsequently, it transmits the intermediate state to (a subset of) its neighbors. 
Finally, agent i forms a convex combination (weighted average) of its own and its active neighbors’ 
intermediate states, with the coefficients { W lh t '■ j £ Oqt}, i £ V. Mathematically, the state update of 
agent i is: 



( 1 ) 


with the initialization Xqo = 0,/- To derive a more compact representation, collect for each t the agents’ 
weights Wij.t in an N x N matrix W/ as follows: for any pair (j, i) E E that satisfies j E 0;p \Wt\ij 
is assigned the value Wijp and equals zero otherwise, and for any i G V, [W t ]n = 1 - ZjeojWtlij. 
We refer to matrix Wt as the weight matrix. Due to the fact that { W tJ j : j £ Oij.t} form a convex 
combination, Wt is stochastic for any t. Further, let <D(f, s), for t > 1 and t > s > 1 be defined as 
<h(f, s) = Wf ■ ■ W s , for 1 < s < t. From we obtain: 



( 2 ) 


Algorithms of form ([!]) and Q have been previously studied, e.g., in 0,(81, and (9J. 

We now state our assumptions on the weight matrices and the agents’ observations. 

Assumption 1 (Network and observation model). 

1) Obsen’citions Zip i = 1,..., N, t = 1,2,... are independent both across nodes and over time; 

2) For each agent i, Zip t = 1,2,... are identically distributed; 

3) Quantities Wt and Z l fi are independent for all i, s, t. 

The model above is very general. In particular, in terms of the agents’ interactions, it allows for 
directed topologies and asymmetric weight matrices, and it also allows for time dependencies between 
the weight matrices; directed topologies and temporal dependencies are cases that are much less studied 
in the literature. In terms of observations, we remark that the model above allows for non-identically 
distributed observations. 

We next introduce the rates of large deviations and motivate their use for performance characterization 
of algorithm (|T|. 

Rates of large deviations at individual agents. Suppose that, for some i, X,j converges almost surely 
(a.s.) to a deterministic vector 6 £ M d , e.g., the vector of d parameters that the system wishes to estimate. 
In many scenarios, it is of interest to determine at what rate this convergence occurs. To explain why 
this is important, suppose that we wish to determine 6 up to a certain accuracy defined by the accuracy 
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region C C M d , where 9 G C. Let T t denote the time interval after which belongs to C with a 
prescribed, high probability, say 0.97. For convenience, define also the complement of C, D = W 1 \ C, 
usually called the deviation set. Since X- ht converges a.s. to 9 , we know that the probability that X,j 
remains outside of C, P (X lt G D ), vanishes as t —> +oo. The question that we ask then is how fast 
this probability vanishes with time. It turns out that in many scenarios this convergence is exponential 
(see na for the scalar, d = 1 case). That is: 


P (X i>t € D) 


* —tli(D) 
~ e v ', 


(3) 


for a certain function I*, where, we recall, ~ means that the two functions are asymptotically equal at 
the logarithmic scale. Function / t : B (K d ) M+ is usually called the rate function. Relating Ii with 
time Tj, we see that T, can be approximately computed as 

log(l - 0,97) 




h{D) 


(4) 


The quality of the approximation in Q improves for higher accuracies (i.e., smaller region C aiound 9). 
In the context of, e.g., Neyman-Pearson hypothesis testing, rates /, directly correspond to error exponents: 
taking, for example D to be the false alarm region [0, +oo) under Hq, /,(/)) gives the error exponent 
of the false alarm probability at sensor i. The problem that we address in this paper is finding the rate 
functions /j, i 6 T: 

lim —logP (X iit eD) = Ii(D), (5) 

t—>+oo t 


whenever the limit above exists for any set D e BiW 1 ). For further details on the use of large deviations 
rate functions in probabilistic inference, we refer the reader to If22l . li23l . j24l . 


III. Preliminaries 


Before we start our analysis, we first review in Subsection III-A basic large deviations concepts and 


tools. Subsection III-B then provides our intermediate results on the large deviations principle and the 
corresponding rate functions of an isolated agent and a fusion node. 


A. Large deviations preliminaries 

We define the large deviations principle and introduce, for each i, the logarithmic moment generating 
function of observations Z l>t . We then define the conjugate of a function and state some important 
properties of log-moment generating functions and their conjugates in general, and in our particular 
setup as well. 
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Large deviations principle. A rate function is any function that is lower semi-continuous, or equivalently, 
that has closed sublevel sets. A sequence of random variables Z t G R d is said to satisfy the large deviations 
principle (LDP) with rate function / if for any measurable set D G B(R d ) it holds that 

— inf I(x) < liminf —pfZt G D] < lirnsup -P (z t G D] < — inf) I(x). (6) 

xeD° t->+oo t \ J t— s>+oo t V / x££) 

Essentially, what the large deviations principle tells is that, for any (nice enough) set D, probabilities 

that Z t belongs to D decay with t exponentially, with the rate equal to 1(D) = inf x£ dI(x). One of 

the key objects in proving the large deviations principle and computing the rate function in general (see 

Cramer’s and Gartner-Ellis theorem |25l l,| (26l ) are the log-moment generating function and its conjugate, 

which we introduce next. 

Log-moment generating function of observations The log-moment generating function A, : R d —> 

R U {+ 00 } corresponding to Z, f is given by: 


At (A) 


logE 



for A G R d . 


(7) 


For the special case when all the agents’ observations are identically distributed, we let A denote the 
corresponding log-moment generating function, A = A,, for any i. 

The second key object of interest in our analysis is the conjugate of a log-moment generating function. 
Let A be the log-moment generating function of a (/-dimensional random vector Z. Then, the conjugate, 
or the Fenchel-Legendre transform, of A is given by 


I(x) = sup x T A — A(A), for x G R d . (8) 

AeR d 

When Z, j are i.i.d., we will denote by / the conjugate of A. To illustrate how to compute A and 
I, we consider the case when Z,j is a discrete random vector, i.e., when the agents’ measurements are 
quantized. 


Example 2 (Quantized observations). Suppose that the agents’ obsenmtions Z t j are i.i.d., discrete random 
vectors, taking values in the set A = {ai,..., a/,}, according to the probability mass function p = 
(pi, where ai G R d for l = 1, ...,L. For any A G R d , the value A(A) is then computed by 

A(A) = log ^Pie AT “'j ■ (9) 

It can be seen that the function A in is finite on the whole space, i.e., 'D \ = R d . Also, for the special 
case when ai = e/, the conjugate of A can be shown to be the relative entropy with respect to p, given 
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by m p. 41]: 


d, 

X(x) = 

1=1 


Xl_ 

Pi' 


for any x G A d mir/ equals +oo, otherwise. 


( 10 ) 


Example 3 (Gaussian observations). It can be shown by simple algebraic manipulations that when Z t j 
is i.i.cl., Gaussian, with mean value rn and covariance matrix S, the log-moment generating function A 
and its conjugate I are both quadratic and given, respectively, by l\26\l : 

A(A) = m T A + ^A t SA, I{x) = ^(x — m) T S~ 1 (x — m). 

To simplify our analysis, we make the following assumption. 

Assumption 4. V\. = M d , i.e., Aj(A) < +oo for all A G M d , for each i. 

Assumption [4] holds for arbitrary Gaussian and discrete random vectors, and also for many other 
commonly used distributions; we refer the reader to llT3l for examples of random vectors beyond 
Examples [2] and [3] that have a finite log-moment generating function. 

Properties of log-moment generating functions and their conjugates. For future reference, we list the 
properties that an arbitrary log-moment generating function A and its conjugate I satisfy; proofs can be 
found in |27] p.8] and (26l p.27, 35]. 


Lemma 5 (Properties of a log-moment generating function and its conjugate). Consider the log-moment 
generating function A and its conjugate I, associated with an arbitrary d-dimensional random vector Z. 
Let 6 = E [Z], Then: 

1) function A satisfies: 

a) A(0) = 0 and VA(0) = 6, when 0 G 27S; 

b) A (■) is lower semi-continuous and convex; 

c) A(-) is C°° on VI; 

2) and function I satisfies: 

a) / is nonnegative and I (6) = 0; 

b) / is lower semi-continuous and convex; 

c) if 0 G XT’ , then I has compact level sets. 

d) I is differentiable on Vj, 


April 29, 2015 


DRAFT 


10 


We end this subsection by stating a simple but important property of the log-moment generating 
function that follows from its convexity and zero value at the origin. We note that the right-hand side of 
inequality ( fTT| ) was previously proven in lH3l (for the case d = 1). 

Lemma 6. Let A be an arbitrary log-moment generating function. For any a E A ;V1 and A E R d , 

/ i \ * ^ 

M b A pE A (“ iA )^ A ( A )' (U) 

' ' i= 1 

Proof: We first prove the right-hand side inequality in ( [IT] ). (The proof is analogous to the proof of 
the same inequality for the special case d = 1 ff3ll : for completeness, we provide the proof here.) Fix 
? E [0,1]. Then, by convexity of A and the fact that A(0) = 0, we have 

A(?A) = A(? A + (1 - c)0) < ? A(A) + (1 - <0 A(0) = ?A(A). 

Now, fix an arbitrary a E A N ~ 1 . Applying the preceding inequality for = a*, for i = 1,..., N, yields 
the claim by summing out the resulting left and right hand sides. 

To prove the left hand side inequality in GD. consider the function g\ : K N H > R, g\(/3) = 
j A(/3jA), for 3 E M. N . We prove the claim if we show that the minimum of <j\ over the unit 
simplex is attained at 1/N ljv = (1/W,..., 1 /N) E A^ -1 . Since g\ is convex (being the sum 

of convex functions), it suffices to show that there exists a Lagrange multiplier i/£l such that the pair 
(1/N Ijv, v) satisfies the Karush-Kuhn-Tucker (KKT) conditions |[28| . To this end, define the Lagrangian 
L(/3, v) = g\(3) + ^(1 tv P ~ 1), for some n E R, 3 E R^. We have 

dpMPM = A T VA(&A) + 

Taking 3i = 1/AT and v = —A T VA(1/AA), proves the claim. ■ 

B. Two extreme cases: isolation and fusion 

To set benchmarks for the performance of distributed inference ([!]), we consider two extreme cases of 
the agents’ cooperation: 1) complete agent’s isolation, when an agent operates alone, making inferences 
based on its own observations only; and 2) network-wide fusion, when each agent has access to all of 
the observations. Mathematically, the state of agent i corresponding to these two cases are as follows: 
2Q S °' = l/t£U f° r * ^ V, for the case of isolated agents obtained when in Q if) = /,/, and 

X( en = l/(Nt) YliLi Zi, s > f° r the case of fusion center obtained when Wt = Jd- Theorem [ 7 ] 

computes the corresponding large deviation rates, and it also asserts that, when the observations are i.i.d., 
the fusion-based rate scales linearly (with constant one) with the number of participating agents. 
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Theorem 7. Suppose that Z, j are i.i.d. for all i and t. Then, 

1) for each i, the sequence 2Q S ° ] satisfies the LDP with rate function i] so1 = I; 

2) the sequence Xf" satisfies the LDP with rate function L cen = NI. 


Clearly, by the strong law of large numbers, with both isolated nodes and fusion center, the corre¬ 
sponding states 2Q s t o1 , i = 1, .... N, and V t cen converge a.s. to m := E [Zif. 

Proof: Since Zj )S are i.i.d., part [T] follows by a direct application of Cramer’s theorem |(25j], [26, 
p.36]. Turning to part [^j note that Xf" can be written as an average of i.i.d. samples 1 /N Z.^ s , 
Xf" = 1/i X)s=i 1/-^ TliLi Zi,s- Thus, again by an application of Cramer’s theorem ll25l . we see that to 
prove part [ 2 ] it suffices to show that the conjugate of the log-moment generating function of l/N Z t , s 
is NI. Computing the log-moment generating function of l/N Z, s at A E M d , we obtain: 


N 


logE 


e |Ef = , A T z^ 


^logE 




2—1 


= iVA|A 


where in the first equality we used the fact that the Z lyS are independent, for fixed s, and in the second 
equality we used that they are identically distributed, with log-moment generating function A. Finally, 
simple algebraic manipulations reveal that the conjugate of NA(X/N) equals NI: for any x E W 1 


sup ,t t A - NA ( -A ) = N ( sup x 1 ( ) - NA ( -A ) ) = NI(x) 


AeR d 


A 


N 


T 


AeR d 


A 


N 


A 


N 


Theorem [ 7 ] asserts that the rate function of any isolated agent i is If" ] = I, where I is the conjugate 
of the log-moment generating function of its observation, whereas the rate function of the network-wide 
fusion is N times higher, If" = NI. Intuitively, for the general case of algorithm 0. we expect that the 
rate function of a fixed agent i should be between these two functions, I and NI. It turns out that this is 


confirms that this is true even for arbitrary (asymmetric) random matrices. 

IV. Rate functions I t for deterministic weight matrices 

This section considers deterministic weight matrices. The first result that we present, Theorem [8] 
computes the rate functions /., for the case when the weight matrices at all times arc equal to a stochastic 
matrix A such that |A 2 (^4)| < 1. (This means that the underlying network has only one initial clas^J 

2 An initial class of a directed graph G is any communication class of G that has no incoming edges |29| . We also note that 
initial classes of G correspond to essential classes of the transpose of G (the graph that results from reversing the directions of 
edges in G Rol ). 


indeed the case - Corollary rh proves this for deterministic matrices, and Theorem 15 later in Section VI 
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e. g ., ed, eao We then focus on the special case when all observations arc Gaussian (with possibly 
different parameters across agents), and we calculate the rate functions in closed form. Further, we 
formulate the problem of optimal network design and show that it can be efficiently solved by an SDP 
when the observations are Gaussian. 


Theorem 8. Let Wt = -A for each t and let A ssumptions [7] and [7] hold. Suppose that X-> (-4) < 1 and let 
a denote the left eigenvector of A corresponding to the eigenvalue 1. Then, for each i, Xi t, t = 1,2,... 

satisfies the LDP with the rate function I, = I, where I is the conjugate of 

N 

A(A) := ^Aj-(ajA), AeR d . 

3 = 1 

Moreover, for each i, Xconverges a.s. to fh := Ylf=i a j m j- 


Proof: To prove the first part of the theorem, we apply the Gartner-Ellis theorem ll26ll . Fix i € V 
and let A t (X) := j logE [e tAXi ’‘], for A E Using that Zj. t are independent and that <b(t, s) = A t_s+1 
are constant, we obtain 


Ai(A) = - logE 


d aEU£ 


t N 




EE logE 

S=1 j = 1 
t N 

EEM1^‘ + V) 

S=1 j = 1 


N t 

= ErE A i(i' 4 ’'i« A >- (12 > 

j= 1 r=l 

From | A 2 (^4)| < 1 we have that A r —> la T as r — > +00 l32l . and, hence, for any i, [A r ]ij aj. Consider 
now a fixed j. Then, by continuity of A j, A j (’[A r ] ); A) —> Aj(ajX), and hence the Cesaro averages must 
converge to the same number: 


lim - 

t —^-|-oo ~t 


E A ;(m« A > 

r —1 


Aj(ajX). 


Going back to (T2| ) and taking the limit yields liiii/^ +00 A/ (A) = A(a ? A). Thus, conditions for 

applying the Gartner-Ellis theorem are fulfilled, and thus we have that, for each i, X lt satisfies the large 
deviations principle with the rate function equal to the conjugate of Aj(ajA). 

It remains to prove that Xij at each i converges to fh = , ajirij. First, note that A is in fact the 

log-moment generating function of Ylf=i a jZj,u f° r an Y t. This easily follows from the independence 
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of the Zj t t s, for t fixed: 


log IE 




N 

^iogE 


_ CL n A *Zt n 


N 

Aj(ojA). 


j=i i =1 

Thus, being a log-moment generating function, A satisfies the properties given in Lemma [l] In particular, 
from the lower semicontinuity and convexity of A it follows that A and I are the conjugates of each 
other. Invoking a classical result for conjugate functions, see, e.g., eq. (1.4.6) on p. 222 in lt33l . we have: 

Argmin j/(x) : x € M d j = <9A(0), (13) 


where, we recall, <9A(0) denotes the subdifferentiaj^jof A at A = 0. We will show that <9A(0) is a singleton 
and that it equals {fh}. To do this, note that, by our assumption, 'D, = K f/ for each i. Thus, V~ K = W 1 . 
In particular, 0 G D~ and the claim follows by combining parts 
^Ef=i a jZj,t] = Ey=i a j m j = We conclude that I(x) = 0 if and only x = fh. 

We next use the previous conclusion together with convexity of I to show that, for any e > 0, 


la 


and 


lc 


of Lemma Mj and noting that 


inf I(x) > 0. (14) 

x : ||rc—m||>e 

First, since / is convex and it achieves its minimum at fh, it must be that I is nondecreasing along any 
half-line that starts at m. Hence, inf /G [ f +oc j I(fh + td) = I(m + ed), for any d. This in particular implies 
that inf I . gR d.|| 3 ._^ l ||> e I(x) = inf I . gR d.|| 3 ._^|| =e I(x). To prove the claim in < [T4| ), we need to show that the 
preceding infimum is strictly greater than zero. Since I is lower semi-continuous and the set under the 
infimum dld^Ae) is compact, it follows by Weierstrass theorem that / attains a minimum on 
denote this minimum by x e . Recalling now that I(x) > 0 for any x / fh, we conclude that it must be 
that I(x £ ) > 0. This concludes the proof of the claim in ( |T4| ). 

Having ( |T4l ). it is now easy to complete the proof of the second part of Theorem [8] Fix i E F. From 
the upper bound of LDP, proved in the first part, and ( fl4] ), we have that for any e > 0: 

limsup - log P (||3Q, t — fh\\ > e) < —C e < 0, (15) 

t —>+oo t 

where we denoted C e = I(x e ). The previous inequality implies that for any 5 > 0 we can find a constant 
Ks such that, for all t, P(||Aj )t — m\\ > e) < Choosing for each e, 5 = C e /2 , we obtain 

exponential convergence of X^t to fh. By the first Borel-Cantelli lemma |[34j, this in turn implies almost 
sure convergence of ■ 


’The subdifferential of a convex function / : l d k> R at a point x G R d is the set of all points s £ R d such that, for all 
V € f(y) > f(x) + s T (y - x ) ED- 


April 29, 2015 


DRAFT 








14 


Let G denote the induced graph of A, i.e., G = (V, E) where E = {(*, j) : Aji > 0}, e.g., ll35l . 

Corollary 9. When Z,j are i.i.d. (identical agents), it holds 

I <T < NI, (16) 

where I is the conjugate of an agent’s log-moment generating function A = A j and the inequalities 
in © hold in the pointwise sense. Moreover, the lower bound in © is attained whenever there exists a 
“leader” agent i that satisfies An = 1 and for any j there is a (directed) path from i to j in the induced 
graph of A. The upper bound is attained when A is doubly stochastic with positive diagonals and the 
induced graph of A is strongly connected. 

Proof: When Z,j are i.i.d., N 

A(A) = ^A( flj A). (17) 

3 =1 

By Lemma [6] applied to a = a (note that a is a stochastic vector), from eq. ( [17] ) we obtain 

A t x - A(A) < X T x - A(A) < A t x - NA (1/NX). 

Taking the supremum, the right-hand side inequality in ( |T6j ) follows by the following simple manipulations 
sup AgR d X T x — NA (1/NX) = (V(sup A , gR d X' T x — A(A')) = NI(x). The left-hand side inequality in (©) 
is proven similarly. By Lemma [6} the log-moment generating in (T7| ) is upper bounded by A, and, by 
similar calculations as in the above, we get I(x) = sup AgRd X T x — A(A) > sup AgRd X T x — A(X) = I(x). 

To prove the second part of Corollary [9] suppose that, for some i. An = 1 and that in the induced 
graph of A there is a path from i to any node j. By Theorem [8] the claim is proven if we show that 
X%(A) < 1 and that e, is the left eigenvector of A corresponding to the eigenvalue 1. Let G denote the 
induced graph of A. To prove that A 2 (A) < 1, it suffices to show that G has exactly one initial class and 
that this class is aperiodic [37]. Since A is stochastic and An = 1, we have A t] = 0 for all j G V, j f i. 
Thus, {i} is an initial class of G. We next show that this is in fact the only initial class in G. Fix a node 
j i and let C(j) denote the class of G that j belongs to. Note that C(j) cannot contain i (otherwise 
it would be possible to reach i from j, which can’t be true because An = 0, for all l f i, and thus there 
are no edges pointing to i). Since (by our assumption) j can be reached by a directed path from i, and, 
on the other hand, i / C(j), there must be an edge pointing to C(j). Hence, C(j) is not an initial class 
of G. Repeating this for every j, we prove that there are no other initial class of G beside { i }. Finally, 
it easy to see that {(} is also aperiodic (An > 0), hence proving that A 2 OA) < 1. 
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It only remains to verify that e,; is the left eigenvector: since A ri = 1 and A is stochastic, the z-th row 
of A, ej A, equals e t . This completes the proof of the claim. 

Suppose now that A is doubly stochastic with positive diagonals and a strongly connected induced 
graph. Similarly as with the lower bound, by Theorem [8j it is sufficient to prove that 1/Nljy is the left 
eigenvector of A corresponding to the eigenvalue 1 and that X- 2 (A) < 1. Since A is doubly stochastic, it 
must be that a T A = a T for a = l/zYl/y. Finally, since A has positive diagonals and a strongly connected 
induced graph, we have that A is irreducible and aperiodic, and hence \ 2 {A) < 1 OTil (see also Corollary 
8.4.8. in G2). This completes the proof of Corollary [9] ■ 

Rate / for Gaussian observations. Of special interest is the case when observations Z^t are all Gaussian. 


For this case, Lemma 10 gives a closed form expression for the rate function /. 


Lemma 10. Suppose that Zj f ~ A f (mj,Sj), for j E V, where Sj, for each j, is a positive definite 
matrix. Function I from Theorem [i] is then given by 


where m = ffjLi a j m j and S = a j^j- In particular, when m.j = m and Sj = S, I(x ) = 

1/ a'j)I(x), where I (x) is the nodes’ individual rate function given in Example |jj 

Proof: Fix x E M. d and recall that the log-moment generating function of a Gaussian vector of mean 


I(x) = -(x — m)S 1 (x — m), 




(18) 


m 


and covariance S is A (->■ A 1 m + 1/2A 1 S A. Then A(A) = YljLi a j ^ ' m j + aj^X' SjX, and thus 


N f ol , \ 

I(x) = sup A t x — aj ( X T mj + a^-A T 5,A ) 

AeR d jz i V 2 J 


(19) 


Since the function under the supremum is (strictly) concave, we obtain the optimizer A* from the first 
order optimality condition 

N N 

x — ^2 ajmj — *22 a'jSjX = 0 . 

3 = 1 i =1 


-i 


It follows that A* = \ J2j=i a ]Sj) [ x ~ Ylj=i a j m jh which, when inserted in ([T9]), yields the 
identity ([18]). ■ 

Remark 11. It is possible to determine I analytically even when matrices Sj, j = 1, ...,N, and vector 
a are such that S is not invertible. It can be shown that the expression for I for this case is: 

(x — fh) T S^ (x — m ), x E Tt{S) 


I{x) = 


+oo, 


otherwise 


April 29, 2015 


DRAFT 



16 


V. Network design 

From Theorem [8] and Corollary [9} it is clear that the performance of algorithm ([[]) critically depends 
on the choice of the weight matrix A, and in particular, on its left eigenvector a. We therefore pose the 
problem of optimizing a, for a fixed desired accuracy region C: 


maximize inf xgR d\ c 7(x) 
subject to a G A^ -1 


( 20 ) 


where I is the rate function from Theorem [8] We denote by a* c and If, respectively, an optimal solution 
and the optimal value of problem ( |20| ). 

We exploit the analytical expression ( p~8j ) for the rate function from Lemma [TO] to show that, for the 
Gaussian observations, problem ( [20] ) can be solved efficiently. We assume that all the nodes are observing 
the same set of physical quantities 6 = ..., 9d) T , embedded in the local sensor noises. Hence, the 


observations Z,j have the same expected value 0 =: rri = rn, across all nodes. We show in Lemma 12 


that when C is a ball, (20) can be formulated as an SDP. 


Lemma 12. Consider the setup of Lemma [TO] when rn, = rn. When the confidence set C is an Euclidean 


ball of some arbitrary radius £ > 0 centered at m, L> rn {Q, the optimal solution of (201 is obtained by 
solving: 


minimize 
subject to 


7 

lh is 
S1 T I Nd _ 
a € A^ -1 


( 21 ) 


where S £ 'R NdxNd is a block diagonal matrix given by S = diag/aiS^ 2 ,..., ajv-S l ]/ 2 j, and X = 
[Id ... Id] £ W ixNd , where Id repeats N times. Furthermore, Iq = (?/(2y*), where 7* is the optimum 

of m 


Remark 13. Although problem ( |20| ) involves the expected value of the observations m (which we don’t 
know), it is clear from the equivalent reformulation © that, under the stated assumption, the knowledge 
of m is not needed for discovering the optimal a in We also remark that, for the same assumptions, 
the solution of © does not depend on the particular accuracy (: once © is solved, the same vector 
a.f applies for all C = II m (Q, Q > 0. 
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Remark 14. When the observations are one-dimensional (d = 1 ), it can be shown that the SDP in (21) 
reduces to a quadratic program (QP). 

Proof: We start by finding a closed form expression for the objective function inf 3 . eK ,n B J(x), 
for a given vector a. Similarly as in the proof of Theorem [ 8 ] it can be shown that for any £ > 0, 

inf I(x) = inf I(x ) = min /(x). 

:cEM d \.B m (£) xGM d :||:r—m||>£ a:EK d :||a;—m||=£ 

It is easy to see that the latter problem can be reformulated as: 

c 2 


mm ttV T S 1 n = l/A ma X (S). 
eR d :|MI=i 2 


( 22 ) 


Maximizing 1/A max (5) corresponds to minimizing A nmx (S) and hence we obtain that (20 1 is equivalent 
to: 

'N 21 

Ji l 

(23) 


minimize A max \J2 j=1 tfSj 
subject to a G A ^" 1 

where the optimal value of (20) If is obtained as f 2 / (2A*), where A* is the optimal value of (231. We next 
show that ( [23] ) can be recast in the SDP form ( |2T| ). Introducing the epigraph variable 7 G M ll28l yields 
the constraint Y^f=i a2 jSj — lid, , which can be equivalently represented as 7 Id — 1S(I SI f 0. 
Since the identity matrix I^d is positive definite, equivalence of 
complement theorem 


and © follows from the Schur 


VI. Universal bounds on the rate functions for general, random weight matrices 

We have seen in the previous section (Corollary [9]) that, when the weight matrices Wt are determin¬ 
istic and constant, the states exhibit a very interesting and fundamental property: their large deviation 
probabilities P (Xi.t E D) have the rates that are always lower than the corresponding rate of the fusion 
center, and always higher than the corresponding rate of a node working in isolation. Theorem [15] that we 
present next asserts that this property in fact holds, not only for deterministic, but for arbitrary sequences 
of random weight matrices. 

Theorem 15. Consider the distributed inference algorithm © under Assumptions [7] and [7] when Z, j 
are i.i.d. (identical agents). For any measurable set G C W 1 , for each i: 

— inf NI(x) < lim inf - logP(2Qj G G) (24) 

x £ G ° t —^+00 t 

< lim sup - logP (X i)t G G) < — inf I(x). (25) 

t->-+oo t ’ X£G 
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Theorem 15 asserts that, no matter how we design the agents’ interactions (represented by the weight 
matrices), in terms of large deviations performance, algorithm 0 can never be worse than when a node 
is working in isolation, but it also can never beat the fusion center. This result is important as it provides 
fundamental bounds for large deviations performance of any algorithm of the form ([T]) that satisfies 
Assumptions [T] and [4] and processes i.i.d. observations. In the next two subsections we state our proofs 
of Theorem [15] 


A. Proof of the upper bound 


Fix an arbitrary i e V. To prove (24 1 for node i, it suffices to show that, for any closed set F, 


limsup -logPpQ t ef)<- inf I(x). 
i->+oo t ' x €F 


(26) 


To see why this is true, note that, for an arbitrary measurable set D, there holds P (X* t 6 D) < 
P (X l t e D ). Applying ( [26] ) to the closed set F = D yields ( [24] ). 

The proof of ([26]) consists of the following three steps. 

Step 1: We use the exponential Markov inequality, together with conditioning on the matrices W \,.... Ik), 
to show that, for any measurable set D C R d , 


- logP(X M £/))<- inf A t x — A(A). 

t ' x£D 


(27) 


Step 2: In the second step, we show that ( |27| ) is a sufficient condition for ( [26] ) to hold for all compact 
sets F. Lemma IT6l formalizes this statement. 


Lemma 16. Suppose that ( 27) holds for any measurable set D C R . Then the inequality ( |26| ) holds for 
all compact sets F. 

The proof of Lemma |T6] uses the standard “finite cover” argument: for a compact set F, a finite number 


of balls forming a cover of F is constructed, and then ( |27| ) is applied to each of the balls. The details of 
this derivation are given in Appendix [A] 


Step 3: So far, Steps 1 and 2 together imply that ( [26] ) holds for all compact sets. To extend ( [26] ) to all 
closed sets F, by a well known result from large deviations theory, Lemma 1.2.18 from ll26l . it suffices 
to show that the sequence of measures p^j : B(R d ) H > [0,1], p-uif)) := P (Xj )t £ D) is exponentially 
tight. We prove this by considering the family of compact sets H p := \—p,p] d , with p increasing to 
infinity. The result is given in Lemma [T7] and the proof can be found in Appendix [B] 
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Lemma 17. For every i G V, 

lim limsup pa (Hf) < — oo. (28) 

p —>-+oc t—^-)-oo 

Hence, the sequence {Hi.t} l= , 2 A exponentially tight. 


We now provide the details of Step 7. 

Step 7. The proof of ( |27| ) is based on two key arguments: exponential Markov inequality ll34ll and 
the right hand side inequality of Lemma [h] For any measurable set I) C W 1 and any A G W 1 , by the 
exponential Markov inequality, we have 

1 {x,, t eD}< e tXTXi ^ tin ^ DXTx , (29) 


which, after computing the expectation, yields 


p (Xi, t g D) < e -* inf ^ D A T x E 


e 


t\ T X it t 


(30) 


We now focus on the right hand side of ( |30| ). Conditioning on Wi ,..., Wt, the summands in ([2]) become 
independent, and using the fact that the Zif s are i.i.d. with the same log-moment generating function 
A, we obtain 


E 


e tx T x. 


,t 




= e EUx Ef =1 A([$(i, S )]„A)_ 


(31) 


Applying now the right-hand side inequality Lemma |6] to ^ A ([‘h(7, 6')], 7 A) for each fixed s (note 
that, for a fixed s, [[$(t, s)]a,[T>(f, s)] iN ] € A n 1 ), it follows that the conditional expectation above 
is upper bounded by e tA(x> , i.e., 


E 


JX T Xi, t 




< e 


tA(A) 


(32) 


for any A G W 1 . Since in ([32]) W\, were arbitrary, taking the expectation, we get E 


q\ T Xi 


d *A(A) 


. Combining this with ( |30| ), we finally obtain 

1 


- logPfX,; t G 79) < — inf A x + A(A). 

t ' x&D 


< 


(33) 


B. Proof of the lower bound 

We prove ( [25] ) following the general lines of the proof of the Gartner-Ellis theorem lower bound, 
1. However, as we will see later in this proof, we encounter several difficulties along the way 


see 


that force us to depart from the standard Gartner-Ellis method and use finer arguments. The main reason 
for this is that, in contrast with the setup of the Gartner-Ellis theorem, the sequence of the (scaled) log- 


moment generating functions of X^t (see ahead ([35])) need not have a limit. Nevertheless, with the help 
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of Lemma[ 6 | we will be able to “sandwich” each member of this sequence between A(-) and N A (1/A r -). 


This is the key ingredient that allows us to derive ([25]). The proof is organized in the following four 
steps. 

for (|25j) to hold. Namely, 


Step 1. In this step, we derive a sufficient condition, given in Lemma 18 


to prove ( |25j ) for a given set D, it suffices to confine X l t to a smaller region B x (5) within D, and 
show that, conditioned on any realization of the matrices W \,... ; IT, the rate of this event is at most 


NI(x). Lemma 18 is proven by applying Fatou’s lemma ll34ll to the sequence of random variables 
Rt := \ logIP(X,;y £ I) | W] ..... IT ) , and then combining the obtained result with the simple fact that, 
for every x £ D° and all 6 sufficiently small, B x (5) C D. The proof is given in Appendix |C| 


Lemma 18. If for every x 6 and u 6 Cl, 


lim liminf - logP (X it £ B x (8) |ITi, ..., W t ) >-NI(x), 

5->0 t^+oo t 


(34) 


then 


holds for all measurable sets D. 


Step 2. To prove ( |34| ), we introduce the scaled log-moment generating function of Xjj, under the 
conditioning on W \,.... IT, 


At (A) := j logE 


tx T x iit 


W h ...,W t 


(35) 


It can be shown (similarly as in Step 1 of the proof of the upper bound) that, for any A G W 1 , 

t N 

A ‘W = 7EE A ([ $ ( t ’ s )l i i A )’ (36) 

s =1 j=l 

where <F>(t, s) = Wf ■ ■ W s . Note that At is convex and differentiable. However, A t is not necessarily 
1 -coercive ll33l . which is needed to show (34» for all point^] x G R d . To overcome this, we introduce a 
small Gaussian noise to the states X^t and define, for each t, Y l t = Xj t + V/y/Mt, where V has the 
standard multivariate Gaussian distribution J\f(0d,Id)> an d, we assume, is independent of Zj jt and IT), 
for all j and t (hence, V is independent of X r j, for all t). The parameter M > 0 controls the magnitude 
of the noise, and the factor l/sji adjusts the noise variance to the same level of the variance of Xj j_. 

For each fixed M, let A t ,M denote the log-moment generating function associated with the correspond¬ 
ing Yi >t , under the conditioning on IT],..., Wt- It can be shown, using the independence of V and X l t , 


4 More precisely, the problem arises when x is not an exposed point of the conjugate I t of A t , as will be clear from later 
parts of the proof (see also Exercise 2.3.20 in HU). 
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that 


— Af(A) + 


2 M 


A G 


(37) 


Hence, the noise adds a (strictly) quadratic function to A*, thus making A tj M 1-coercive, as proved in the 


following lemma. Lemma 19 gives the properties of A/m that we use in the sequel; the proof is given 
in Appendix [D] 


Lemma 19. 1) Function A/ m is convex, differentiable, and 1-coercive. Thus, for any x G there 

exists rjt = t]t(x) such that VA= x. 

2) Let 9 = E [Ziff For any x, the corresponding sequence rjt> t= 1,2 ,is uniformly bounded, i.e., 

\\rjt\\ < M \\x — 9\\ , for all t. (38) 


Using the results of Lemma 19 we prove in Step 3 the counterpart of (|34|) for the sequence Y t j 


|), and in Step 4 we complete the proof of ( |25j ) by showing that ( |34| ) (a sufficient condition for ( |25] )) 
is implied by ((39]). 

Step 3. We show that, for any fixed x, M > 0, and wGO, 


lim liminf - log vt M ( B x (6 )) > —NI(x). 
<5—^0 i->+oo t ' 


(39) 


where u t) M is the conditional probability measure induced by Y^t, vt,M{D) = P (Y^t G D\W\,Wt), 
D G B{R d ). 

To this end, fix an arbitrary x, 5 , M, and u. We prove (|39|) by the change of measure argument. For 


any t > 1, we use the point rj t from Lemma 19 to change the measure on M" from vt,M to vt.M by: 

dv t M 


dm 


't,M 


\z) = e 




z G 


(40) 


Note that, in contrast with the standard method of Gartner-Ellis Theorem where the change of measure 
is fixed (once x is given), here we have a different change of measurqjH for each t. Expressing the 


^The reason for this alteration of the standard method is the fact that our sequence of functions A t ,M does not have a limit. 
6 It can be shown that all distributions vt,M, t > 1, have the same expected value x\ we do not pursue this result here, as it 
is not crucial for our goals. 
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probability u ty M {B X (S)) through v t ,m> for each t, we get: 

~ log i^t,M ( B x {5 )) = 

= A ~Vt x + 7 lo g [ e tr it ( x ~ z ) dv tM (z) 

t JzeB x (6) 

> A t,M(rit) - rijx- 5 \\rj t \\ + ^logf^M (-Bz(5)) • (41) 


We analyze separately each of the terms in ( |4T| ). First, since rj t is uniformly bounded, by Lemma 19 
immediately obtain that the third term vanishes: 


we 


lim liminf — 5 ||r/(|| > — lim 5M\\x — 011 = 0. (42) 

<5—H-0 t— H-oo S—>0 

We consider next the sum of the first two terms. Let I tj M denote the conjugate of A t) M- By Lemma [T9j 
we have that r) t is the maximizer of A A t .x — A^m(A). Thus, the sum of the first two terms in ( |4T| ) 
equals = A/_.v/(r/J — r/Jx. Further, starting from the fact that A-t,M > A t and then invoking 

Lemma [ 6 ] (lower bound), we obtain: 

It,Ad ( x ) ^ sup A t x — A t (\) < sup A t x — NA(X/N) 

AeR d AeR d 

= NI(x), (43) 


which holds for all t > 1 and all M > 0. Comparing with ([39]), we see that it only remains to show that 
the lim inf as t —> +oo of the last term in d4T| ) vanishes with 6. 

It is easy to show that the log-moment generating function associated with is A t ,M '■= A/ m(A + 
r] t ) — At,M(vt)- Let I t) M denote the conjugate of A t) M- Similarly as in the proof of the upper bound in 
Section |VI-A| it can be shown that 


7 iog v t ,M (B c x (5)) < - inf I t , M {w). 

t weB^(S) 


(44) 


The next lemma asserts that the right-hand side of (44) is strictly negative]^] and uniformly bounded away 
from zero. The proof is given in Appendix [EJ 


7 In the proof of the lower bound of the Gartner-Ellis theorem, the sequence At (our A t,M) has a limit A, and, because of 
this, it is sufficient to show that inf^gsc^) I(w) is strictly negative, where I is the conjugate of A. Here, since we do not have 
the limit of the A^ms, we need to prove that the latter holds for each function of the sequence It,M, t > 1, and moreover, that 
the strict negativity does not “fade out” with t. 
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Lemma 20. For any t, there exists a minimizer wt = wt(x, 5) of the optimization problem inf Ii,m(w). 

Furthermore, there exists £ = £(x,5) >0 such that 

h,M ( w t ) > £, for all t. (45) 


Combining ( |44| ) and (45), we get 

v t ,M (B x (5)) > 1 - e - ^, for all t. 

which, together with the fact that v^m is a probability measure (and hence ut,M ( B x (5 )) < 1), yields 


lim - log u t M ( B x {8 )) = 0. 

t —^-|-oo i, 


(46) 


Since (46) holds for all <5 > 0, we conclude that the last term in ( |4]j ) vanishes after the appropriate limits 
have been taken. Summarizing ( |42| ), (43), and ( |46| ) finally proves ( |39| ). 

Step 4. To complete the proof of (25), it only remains to show that ( |39| ) implies ( |34| ). Since = 

Y iit - V/VtM, we have 


P(X itt eB x (25)\W ll ...,W t ) 

> IP (Y i>t € B X (S), V/ViM € B x (5)\Wi,W t ) 

> v t , M (B x {5))-F (v/VtM $B x (6)y (47) 

From ( [39] ), the rate for the probability of the first term in ( |47| ) is at most NI(x). On the other hand, the 
probability that the norm of V is greater than \ftM6 decays exponentially with t at the rate MS 2 / 2, 

^ lim ^ logP (V/VtM G B x (5)^j = — (48) 

Observe now that, for any fixed 5, for all M large enough so that NI(x) < M ^ 2 , ^ic exponential decay 
of the difference in ( |47| ) is determined by the rate of the first term, NI(x). This finally establishes ((34]), 
which combined with [18] proves ([25]). 


VII. Simulation results 

This section presents our simulation results for the performance of algorithm (JT|) for both deterministic 
and random weight matrices. In the deterministic case, we optimize the weight matrix A by optimizing first 
its left eigenvector a; then we subsequently optimize A such that it achieves the fastest averaging speed 
(see ahead ([49])), subject to the condition on the obtained left eigenvector. We estimate by Monte Carlo 
simulations the corresponding estimation error. Simulations show that the optimized system significantly 
outperforms the system where the eigenvector a is uniform and A is doubly stochastic, hence proving 
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Fig. 1: Estimated error probabilities P l; t vs. number of iterations t, for each i. Left (deterministic model): 
dotted lines correspond to W apt , and full lines correspond to W un a. Middle and Right (random model): 
dashed curves correspond to the i.i.d. model, dotted curves to the Markov chain model, and full curves 
to an isolated node (upper) and the fusion center (lower); p = 0.1, q\ = q 2 = 0.3 (middle) and p = 0.5, 
qi = 0.7, q 2 = 0.1 (right). 


the benefit of network design. We then consider randomly time-varying weight matrices and verify by 


simulations Theorem 15 for the following cases: 1) Wt are i.i.d. in time, with i.i.d. link failures; and 
2 ) link failures of each link in the network, independently from other links, are governed by a Markov 
chain. 

Simulation setup. The number of nodes is N = 10. Communication graph C'is formed by placing the 
nodes uniformly at random in a unit square and forming the (biderectional) links between the nodes that 
are within distance r = 0.4 from each other. Observations Z, j ; are standard Gaussian, for each i, with the 
same expected value rrii = m chosen uniformly at random from the [0,1] interval. In the deterministic 
case, the variances Si = of are different across nodes, whereas in the random case all the nodes have 
the same variance S = er 2 . The quantities Si, 7 = 1,..., N, and S are chosen uniformly at random, and, 
in the deterministic case, independently for each i, in the [0,1] interval. 


A. Network design for the deterministic case 

In this section, we consider the deterministic case, when the weight matrix is constant at all times, 
and when the observations are scalar (d = 1). Since all the nodes have the same expected value m, we 
have that fh = m (see Theorem [8]), and thus all the states X l t converge (a.s.) to m. We wish to find 
the weight matrix A = A opt that achieves this convergence with the fastest possible rate function I. The 
accuracy region that we target is C = [m — (,m + £], where we set ( = 0.035. 
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We obtain A opt as follows. We first solve problem ( [21] ) via CVX | [36l , [[37]] to obtain the optimal left 
eigenvector a opt of A opt . Then, we optimize A by minimizing the spectral norm of /I — l 7 vaj pt , while 
respecting the sparsity pattern dictated by the communication graph G, as in l38l . Hence, A opt is obtained 
as the solution of the following optimization problem: 

minimize 11 A — 1 ato J pt 11 
subject to Al m = liv 

T T ’ (49) 

®opt^ ®opt 

A € A 

where A := | A £ M.+ xN : A^ = 0, if ( i,j ) ^ G, i,j £ V”|; see also Section 7.3 in Il38l . Note that the 
rate function is dependent on A only through its left eigenvector a, but the significance of \\A — lAra T || 
is in the finite time performance (i.e., vertical shift of the curves in Figure [T] (bottom) further ahead). 
For the purpose of comparison, we also solve problem ( |49j ) when a op t is replaced by a lim f = 1/JVljv; 
we denote the corresponding solution by v4 un if (,4 um f hence represents the doubly stochastic matrix with 
the fastest averaging on the same topology G as A opt ). 

At each node i and each time t, we estimate the probability of error P,j : , by Monte Carlo simulations: 
we count the number of times that the state of node i at time t, X htl falls outside of the accuracy 
region C, and then we divide this number by the number of simulation runs K = 1000000, Pi.t = 
Y,k =i 1 {X-' t (f: C } . We do this both for the case when algorithm (|T|) runs with the weight matrix 
A opi and when it runs with the weight matrix Aini f. 

The leftmost plot in Figure [T] plots the evolution of the error probability over iterations, in the log- 
scale, for each node i\ dotted lines correspond to ,4 op t while full lines correspond to ,4 lin jf. We can see 
from Figure [T] (left) that for both A opt and ,4 un if the curves at all nodes have the same slope, equal to 
the value of the corresponding rate function over the set C. For the same weight matrix, the vertical 
shift in different curves (that correspond to different nodes) is due to the difference in the observations 
parameters (intuitively, nodes with higher variances of need more time to filter out the noise - and 
thus their error probability curves are shifted upwards), and the placement in the network (nodes with 
more central location in the network converge faster). We can see that the algorithm with the optimized 
left eigenvector achieves much higher large deviations rate than the one with the uniform eigenvector, as 
predicted by our theory. For example, for the target error probability of e -5 ~ 0.007, the optimized system 
requires around 140 iterations on average (across nodes), while the system with the uniform vector a needs 
around 250 iterations for the same accuracy. The reason for this behavior is quite intuitive: optimizing 
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the vector a corresponds to choosing different weights for different sensors depending on their local 
variances (i.e., covariance matrices, when d > 1). 


B. Random weight matrices 

This subsection considers random weight matrices Wt for two cases: i.i.d. link failures and Markov 
chain link failures. With the i.i.d. model, each directed link (j. j) E E can fail with probability 1 — p at 
any given time t\ this occurs independently from other link failures and independently from past times. 
With the Markov chain model, each link (i, j) E G behaves as a Markov chain, independent from the 
Markov chains of other links, such that with probability q\ the link stays online, if it was online in the 
previous time slot, and with probability r /2 stays offline. (For example, if at time t a link is online, then at 
time t + 1 this link stays online with probability q\ and fails with probability 1 — (p ). With both i.i.d. and 
the Markov chain model, the weight matrix at time t equals Wt = In — otL t , where L t is the Laplacian 
of the (directed) topology realization at time t, a = l/(d max + 1), and d max is the maximal degree in G. 

The middle and the right plot in Figure [T] show the estimated error probabilities versus the number of 
iterations for both the i.i.d. and the Markov chain model, for two different sets of parameters: p = 0.1, 
Qi = Q 2 = 0.3 (left) and p = 0.5, q\ = 0.7, q 2 = 0.1 (right). Both simulations are obtained for the same 
value of accuracy £ = 0.1, and one-dimensional Gaussian observations with parameters m and S = o 1 
chosen uniformly at random from the [0,1] interval. The results for the i.i.d. model are plotted in dashed 
lines, while the results for the Markov chain model are plotted in dotted lines. For reference, we also 


plot the estimated error probabilities for perfect fusion and isolation (full lines), see Section III-B the 
lower curve corresponds to fusion. We can see from the plots that, under both models, the rate at which 
the error probability at each node decays is between the decay rate of the isolated node and fusion center 
curves, as predicted by Theorem p~5] We can also see that the agents’ decay rates for the Markov chain 
model arc faster than the ones for the i.i.d. model. This is expected since, for both sets of parameters, 
links in the i.i.d. model are online less frequently than the links in the Markov chain model, once the 
system reaches a stationary regime. Also, we see that improvements in the system parameters (higher p, 
in the i.i.d. model, and higher q\ and lower <p> in the Markov chain model) significantly affect the large 
deviations rates: in the right plot, the rates at each node got closer to the optimal, fusion center rate. 


VIII. Conclusion 

We studied large deviations rates for consensus based distributed inference, for deterministic and 
random asymmetric weight matrices. For the deterministic case, we characterized the corresponding 
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large deviations rate function, and we showed that it depends on the weight matrix only through its left 
eigenvector that corresponds to its unit eigenvalue. When the observations are Gaussian (not necessarily 
identically distributed across agents), the rate function has a closed form expression. Motivated by these 
insights, we formulate the optimal weight matrix design problem and show that, in the Gaussian case, 
it can be formulated as an SDP and hence efficiently solved. When the weight matrices are random, we 
prove that the rate functions of any node in the network lie between the rate functions corresponding to 
a fusion node, that processes all observations, and a node in isolation. The bounds hold for any random 
model of weight matrices, with the single condition that the weight matrices are independent from the 
agents’ observations. 


Appendix A 
Proof of Lemma[T61 

For every <5 > 0, define I s : R d i-a R, I s (x) := min{/(x) — <5, |}, and note that for any D C R rf , 


lim inf I s (x) = inf I(x). (50) 

S^OxeD x&D 

Fix a compact set F. For every y G F, choose X y G R d for which X^q — A(X y ) > i\y) 0 Also, for 
each y choose p y > 0 such that 11 A y 11 < 5. 

Now, fix arbitrary y G F. Then, by construction of p y and X y , we have: 

— inf A T x < — Xly + 5. 

X£By(py) 

Applying ( |27| ) for D = B y (p y ) and A = X y and combining it with the preceding equation yields 

ilogP(X ijt G B y (py)) < 5 - X y y + A(X y ). (51) 


Extracting a finite cover {B y . (p yi ) : i = 1..... K\ of F from the family of balls { B y (p y ) : y G F}, and 
applying ( |5T| ) to each of the balls, we obtain by the union bound 

7 logP (X i t G F) < j log K + 5 - min Xly, - A(X Vi ). 

Recalling that for each y, X y satisfies X y y — A(X y ) = I s (y), and letting t —> +oo, 

lim sup - logP (Xi t G F) < 5 — min I S (yi) < S — inf I s (y). 

t^+oo t ’ i=l,...,K y&F 


Finally, letting <5—^0 and using the property (50) of / , the bound (26) for compact sets follows. 


s Such a point must exist because of the following: If I(y) is finite, then, since I(y) equals the supremum sup A6R d X T q — A(A), 
for every 5 > 0, there must exist a point A’ = X'(S) such that X ,T y — AfA 7 ) > I(y) — S. Since I(y) — 5 > I(y), taking X y to 
be A'fJ) verifies the inequality. We can show in a similar way the existence of X y in the case when I(y) = +oo. 
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Appendix B 

Proof of Exponential tightness of {m,t } t=12 

This section proves Lemma [~p7| Fix i and, for each t and l, l = 1, d, define /./• 1 to be the probability 
measure on M induced by the l -th coordinate of vector X l t , 

^((-oo,p]):=p(xi )t <p), 

for p G M. For each l let A 1 denote the log-moment generating function of Z l it , A 1 (is) := log IE nVZ 
is E M; note that A ? (zx) = A(ise{). Also, let I 1 denote the conjugate of A ; , 


l\p) = sup pis - A 1 (is). 
i/£K 


(52) 


Now, fix p > 0. By the union bound, we have 

d d 

(Hp) < -p\) + Y PiAfa + °°])- (53) 

l=i l=i 

We focus on the term on the right-hand side sum that corresponds to a fixed l. For any fixed u > 0, we 
have 

r,tvX\ t ~ tpV 


Xl t >p)<E 

Similarly as in eqs. ©, conditioning on W \, ...W), we obtain: 

E 



Wi ,.. 

■;W t 

= E 

" e E»=i E”i^[ QfrsXijZj,. 

Wi,. 

.,Wt 


= e E ‘ =1 Ef =1 A'([®(t, a )]„i/) 
= e Ea=! EjLlA([*(t,»)]«I/e,) 


where the second equality follows by the fact that, given W\,... ,Wt, terms vej [$(t, s)]ijZj iS in the 
double sum above are independent. Applying now Lemma [6] for A = uej , and using the fact that A 1 (u) = 
A {ve{) yields 


E 


otvXlt 


< e 


tA‘(u) 


W u ...,w t 

Combining the preceding three equations together with the monotonicity of the expectation, we obtain 

\ logE [x\ t > pj < A\v) - pu. (54) 

We show that if p > eje = e h the infimum of the right hand side of ( |54| ) over all u > 0 equals —l\p). 
To prove this, it suffices to show that if p > 0i, the supremum is not achieved for the negative values 
of ix. Function A 1 is convex and differentiable for all v, and in particular at v = 0 (as a log-moment 
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generating function, see Lemma 0- Thus, for any u, A l (v) > A z (0) + (A l )'(0)i/ = 6iv. Thus, for u < 0, 
we have pv — A l (v) < u(p — 0{) < 0. Since we know that I 1 must be non-negative (see Lemma [lj, the 
claim above follows. Thus, for all p > 6i, we have: 

\ log ( \p, +oo]) < -I 1 (p) • (55) 

By a similar procedure, one can also obtain that j log/r- 1 ((—oo, —p\) < —Ii(—p )■ 

Now, recall that by Assumption |4] V\ = hence, P A < = R. Then, for any p 

Ii(p ) = sup ux - Ai(y) >v\p\- inf A i(y), 

i/SM v - 

where ipj is an arbitrary positive number. Noting that the second term on the right hand side is finite, we 
see that Ii grows unbounded as \p\ —> +oo. Since l was arbitrary, we have that each of the exponents 
in grows unbounded as p increases to +oo. This completes the proof of Lemma [T7] 


Appendix C 
Proof of Lemma[T81 


Fix a measurable set D. We first show that if ( |34| ) holds for any x G D° and any oj e fl, then for any 

x G D° 

(56) 


lim l im inf - logP (Xj t G B x {5)) > —NI(x). 

<5-r0 t-s>+cxD t ’ 

To this end, fix x G D° and fix u; G Q. 

Applying Fatou’s lemma ll34l to the sequence of random variables Rt := \ logP (X t ,t G D\W \,..., Wt), 
t = 1, 2,..., we get 


lim inf E 

t —^-)-oo 


1 


\ogW{X iit <ED\W l ,...,W t ) 


> E [iT]. 


(57) 


where R*(ui) := lim inf^+oo Rt{oj), w G fi. Consider the left-hand side of ( [57j ). By linearity of the 
expectation and concavity of the logarithmic function, we have 


E 


^\og¥{X i!t eD\W 1 ,...,W t ) 


< j logE [P (X i>t G D\W U ..., W t )} = ^ logP (X ijt G D). 


Taking the lim inf as t -A +oo on both sides of the preceding inequality and combining the result 
with ( f57| ), yields: 

lim inf - log P (X* f Gfi)>E [R*] . (58) 

t—>+oo t ’ 

We now focus on the random variable R t . Note that we assumed that D° is nonempty (if the interior 


of D is empty, the lower bound (25 I holds trivially). Since D° is open, for any x G D°, we can find a 
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small neighborhood B x {5q) that is fully contained in D° (where Sq = Sq(x)). Hence, for all 5 < 5q, we 
have B x {5) C D° C D, and thus, for any fixed w £ fl 


Rt > \ logP (Xi,t G B X (S)\W U ..., Wt) 


(59) 


(we used here that the logarithmic function is non-decreasing). Since ( |59| ) holds for all t and all 5 
sufficiently small, taking the corresponding limits yields 


i2* > lim liminf \ logP (X iit € B X (S)\W U W t ) ■ 

5 —>-0 i —>-+oo t 


Using now the assumption ( |34| ) of the lemma to bound the right-hand side of the preceding inequality, 
we obtain R* > —NI(x), which, we note, holds for every point x in D°. Taking the supremum over all 
x G D°, we obtain that for every w£!l, 


R*>- inf NI(x). 

x£D° 


(60) 


Taking the expectation in the left-hand side, and combining with ([C]), we finally obtain the lower 
bound ( |25| ): 

liminf - logP (Xj t € D) > — inf NI(x). 

t —^-{-OO t 1 X{zD° 


Since D was arbitrary, the claim of Lemma 18 is proven. 


Appendix D 
Proof of Lemma[T91 


Being the sum of A t and a (convex) quadratic function, A t) M inherits convexity and differentiability 
from At (in fact, A t) M is strictly convex due to strict convexity ||A|| 2 /(2M)). To prove 1-coercivity, by 
convexity of A t , we have that At (A) > A T 9. Hence, 

A t ,„w>A T » + !A!f. 


Dividing both sides by 


and using in the right hand side that A T > — 

A qjw(A) 


, we obtain 




when ||A|| —> +cx), proving that A tl M is 1-coercive. Strict convexity, differentiability, and 1-coercivity of 
A tf M imply that the gradient map VA t> M is a bijection, see, e.g., Corollary 4.1.3 in |[33ll . p. 239. This 
proves the first part [T] 
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We now prove part [ 2 ] Fix x and fix t > 1. Note that r) t is the maximizer in I i.m(x) = sup AgRd X T x — 
A/,.m(A), and thus it holds that I l.m{x) = x — A t,M(j]t)- Since A t is convex (and differentiable), its 
gradient map is monotone. Hence, 

(VA tint) - vA t (0)) T (nt - 0) > 0. (6i) 


We next show that the value of the gradient of At at 0 equals 9. From (361, we have 


1 


t N 


VA t (A) = TEE[ $ (^)kA([$(M)]„A). 


(62) 


8=1 j = 1 


The gradient of A at A = 0 equals 9, see Lemma[lJ Using the fact that, for each fixed s, .sj], ? = 

1, we obtain that VA t (0) = 9. Thus, from ( [6~i~| ) we have 


(yA t ( Vt )-9) T rj t >0. 


(63) 


Now, note from ( |36| ) that VA/(A) = VA^jvr(A) — A/M, for arbitrary A. Using now the fact VA^m(A) = 
x, ( [63] ) implies (x — 1/Mrjt — 9) T r/t > 0. Thus, (x — 9) T r/t > 2, proving the claim of the lemma 

for this fixed t and x. Since these were arbitrary, the proof of the lemma is complete. 


Appendix E 
Proof of Lemma l20l 


From the fact that ^ = M d , one can show that I t) M has compact level sets (note that A M is 
lower semicontinuous). Thus, the infimum in ( |44| ) has a solution. Denote this solution by wt and let Q 
denote a point for which wt = VA t ,M (Ct) (= VA t,M {Ct + Vt)) (such a point exists by Lemma [19J). We 
now show that |uy| is uniformly bounded for all t, which, combined with part |2] of Lemma 19 
implies that rjt + Ct is uniformly bounded. 


in turn 


Lemma 21. For any fixed 5 > 0 and M > 0, there exists R = R(x. 6, M) < +00 such that for all t: 

1) HiPtll < R, and 

2) \\Ct + r,t\\ < M(R+\\e\\). 


Proof: Lix M > 0, 5 > 0. Define f M , f M : R d i-a M as 


Im{z) 


LaA z ) 


sup X T z- NA(l/N\) 
AeR d 


l|A|| 2 
2 M ’ 


sup X T z — A(A) 

AeR d 


II A|| 2 
2 M ’ 
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for z E M f/ . Note that both f M , f are lower semicontinuous, finite for every z, and have compact level 


sets. Let c = inf zgB c( 5 ) Im( z ) < +00, and define S c = jz E M. d : f M (z) < cj. 

Fix arbitrary t > 1. One can show, with the help of Lemma [HJ that, for any z E 

l M {z) < I tM (z) < f M (z). 


(64) 


Observe now that It,M{wt ) = inf zgb^(S) h,M( z ) — inf zeBRS) Im{z) A c. On the other hand, taking 
in ( |64| ) z = wt, yields / (ty t ) < It.M(wt), and it thus follows that wt belongs to S' c . 

Finally, as 5 C is compact, we can find a ball of some radius R = R(x, M, 5) > 0 that covers S c , 
implying try E Bq(R). Since t was arbitrary, the claim in part 1 follows. 

We now prove part 2. Recall that, for any t, wt and Ct + Vt satisfy wt = V Rt 0 ,M (C t + r h)- Applying 
part |2] of Lemma 19 for z = wt, we have that ||^ t + rj t || < M ||wt — 0 1|. Combining this with part 1 of 
this lemma, 

WCt + vt\\ < M \\w t - e\\ < M sup \\w-e\\<M(R + \\0\\). 

( R ) 

This completes the proof of part 2 and the proof of Lemma [2T] ■ 

Fix x, 5 and M and define r\ = M ||z — 0||, r 2 = M(R + ||0||), where R is the constant that verifies 
Fix now t > 1 and recall that rjt, Ct. and uy are chosen such that x = VA/. _vf (j/i ), It,M ( w t) = 


Lemma 


21 


21 


we have 


irif^eB 0 (5) Ilm(z), and wt = VA tj M ('//. + Ct)- By part| 2 ]of Lemma 19 and part 2 of Lemma 
for rjt and \\vt\\ A ty , | r/ /; + Ct|| A t '2 . To prove Lemma 20 we first show that there exists some 
positive constant r$, independent of t, such that |( 7 1 > r% for all t. To this end, consider the gradient 
map A (->• VA/,a^(A), and note that VA tj M is continuous, and hence uniformly continuous on every 
compact set. Note also that \\r] t \\ , \\rjt + Ct 1 1 A max{ri,r 2 }; that is, points rj t and rj t + Ct are uniformly 
bounded for all t. Suppose now, for the sake of contradiction, that for some sequence of times £*., 
k = 1,2,..., ||CtJ| -> 0, as k -> +cx). Then, \\(rit k + (t k ) — Vt k \\ 0> and hence, by the uniform 
continuity of VA t,M{') on 5o(max{ri, r 2 }) we have 


IIVA- t^iiijtk) ~~ VA t,M(jlt k + CtJII as t —> oo. 

Recalling that x = VA t}M ( r)t k )> w t k = VA t ,M ( Vt k ), yields 

|| wt k - x|| -)• 0. 

This contradicts with the fact that, for all t, wt E -B£(<5). Thus, we proved the existence of r-j 
independent of t such that \\Ct\\ > r 3 , for all t. 
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Now, let 

T = {(77,0 eR d xR d : \\ri\\ < n, ||?7 + CII < r 2 , ||CII > Cl} , 

and introduce g : R d x R d t-» M, 

g((, y) = - M,m(C + v) + VA*,m(C + y) T C■ (65) 

By strict convexity of Awe see that, for any g and ( / 0, the value g(g,() i s strictly positive. 
Further, note that since \t,M and VA t ,M are continuous, function g is also continuous. Consider now 


£ := inf g(g,()- (66) 

(» 7 ,C)£T 

Because T is compact, by the Weierstrass theorem, the problem in ([66]) has a solution, that is, there 
exists (770, Co) £ Y, such that <7(770, Co) = C- Finally, because g is strictly positive at each point in Y (note 
that C / 0 in Y), we conclude that ( = g(g 0 > Co) > 0 . 

Returning to the claim of Lemma [20] by Lemma [2~f] (yt,yt + C t) belongs to Y, and, thus, 

= A t,M(yt) ~ A-t,M(Ct + yt) + VA t) M(Ct + vt) T Ct 

= g{ytXt ) > C- 


This completes the proof of Lemma 20 


References 

[1] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “A survey on sensor networks,” Communications Magazine, 
IEEE, vol. 40, no. 8, pp. 102-114, Aug. 2002. 

[2] J. Abbott, Z. Nagy. F. Beyeler, and B. Nelson, “Robotics in the small, part I: Microbotics,” Robotics Automation Magazine, 
IEEE, vol. 14, no. 2, pp. 92-103, June 2007. 

[3] I. F. Akyildiz and J. M. Jornet, “Electromagnetic wireless nanosensor networks,” Nano Communication Networks, vol. 1, 
no. 1, pp. 3 - 19, 2010. 

[4] S. Barbarossa, S. Sardellitti, and P. D. Lorenzo, “Distributed detection and estimation in wireless sensor networks,” SIAM 
Journal of Control and Optimization, July 2013, http://arxiv.org/abs/1307.1448. 

[5] N. E. Leonard and A. Olshevsky, “Cooperative learning in multiagent systems from intermittent measurements,” SIAM 
Journal of Control and Optimization, vol. 53, no. 1, pp. 1-29, 2015. 

[6] M. (Jetin, L. Chen, J. W. Fisher III, A. T. Ihler, R. L. Moses, M. J. Wainwright, and A. S. Willsky, "Distributed fusion in 
sensor networks - a graphical models perspective,” IEEE Signal Processing Magazine, vol. 23, pp. 42-55, 2006. 

[7] S. Stankovic, M. S. Stankovic, and D. M. Stipanovic, “Consensus based overlapping decentralized estimator,” IEEE Trans. 
Automatic Control, vol. 54, no. 2, pp. 410-415, February 2009. 

[8] P. Braca, S. Marano, and V. Matta, “Enforcing consensus while monitoring the environment in wireless sensor networks,” 
IEEE Transactions on Signal Processing, vol. 56, no. 7, pp. 3375-3380, July 2008. 


April 29, 2015 


DRAFT 



34 


[9] D. Bajovic, D. Jakovetic, J. Xavier, B. Sinopoli, and J. M. F. Moura, “Distributed detection via Gaussian running consensus: 
Large deviations asymptotic analysis,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4381-4396, Sep. 2011. 

[10] S. Kar. J. M. F. Moura, and K. Ramanan, “Distributed parameter estimation in sensor networks: Nonlinear observation 
models and imperfect communication,” IEEE Transactions on Information Theory , vol. 58, no. 6. pp. 3575-3605, June 
2012 . 

[11] R. Olfati-Saber. “Distributed Kalman filter with embedded consensus filters,” in Decision and Control 2005 and 2005 
European Control Conference. CDC-ECC ’05. 44th IEEE Conference on, Dec 2005, pp. 8179-8184. 

[12] R. Carli, A. Chiuso, L. Schenato, and S. Zampieri, “Distributed Kalman filtering based on consensus strategies,” IEEE 
Journal on Selected Areas in Communication, vol. 26, no. 4. pp. 622-633, May 2008. 

[13] D. Bajovic, D. Jakovetic, J. M. F. Moura, J. Xavier, and B. Sinopoli, “Large deviations performance of consen- 
sus+innovations distributed detection with non-Gaussian observations,” IEEE Transactions on Signal Processing, vol. 60, 
no. 11, pp. 5987-6002, Nov. 2012. 

[14] S. S. Stankovic, N. Ilic, M. S. Stankovic, and K. H. Johansson, “Distributed change detection based on a consensus 
algorithm,” IEEE Transactions on Signal Processing, vol. 59, no. 12, pp. 5686-5697, Dec. 2011. 

[15] G. Mateos, 1. D. Schizas, and G. B. Giannakis, “Performance analysis of the consensus-based distributed LMS algorithm,” 
EURASIP J. Adv. Signal Process, vol. 68, Jan. 2009. 

[16] P. Di Lorenzo and A. Sayed, “Sparse distributed learning based on diffusion adaptation,” IEEE Transactions on Signal 
Processing, vol. 61, no. 6, pp. 1419-1433. March 2013. 

[17] R. Rahman, M. Alanyali, and V. Saligrama, “Distributed tracking in multihop sensor networks with communication delays,” 
IEEE Transactions on Signal Processing, vol. 55. no. 9, Sep. 2007. 

[18] S. Kar and J. M. F. Moura, “Asymptotically efficient distributed estimation with exponential family statistics,” IEEE 
Transactions on Information Theory, vol. 60, no. 8, pp. 4811-4831, Aug. 2014. 

[19] D. Li, S. Kar, J. M. F. Moura. H. V. Poor, and S. Cui, “Distributed Kalman filtering over massive data sets: Analysis through 
large deviations of Random riccati equations,” IEEE Transactions on Information Theory, vol. 61, no. 3, pp. 1351-1372, 
March 2015. 

[20] P. Braca, S. Marano, V. Matta, and A. H. Sayed, “Asymptotic performance of adaptive distributed detection over networks,” 
Jan. 2014. http://arxiv.org/abs/1401.5742. 

[21] A. Lalitha, A. Sarwate, and T. Javidi, “Social learning and distributed hypothesis testing,” in Information Theory (ISIT), 
2014 IEEE International Symposium on, June 2014, pp. 551-555. 

[22] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: John Wiley and Sons, 1991. 

[23] H. Chernoff, “A measure of the asymptotic efficiency of tests of a hypothesis based on a sum of observations,” The Annals 
of Mathematical Statistics, vol. 23, no. 4, pp. 493-507, Dec. 1952. 

[24] M. Arcones, “Large deviations for M-estimators,” Annals of the Institute of Statistical Mathematics, vol. 58, no. 1, pp. 
21-52, 2006. 

[25] H. Cramer, “Sur un nouveau theoreme-limite de la theorie des probabilites,” Actualites Scientifiques et Industrielles, vol. 
736, pp. 5-23, 1938, Paris. 

[26] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications. Boston, MA: Jones and Barlett, 1993. 

[27] F. den Hollander, Large Deviations. Fields Institute Monographs, American Mathematical Society, 2000. 

[28] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, United Kingdom: Cambridge University Press, 2004. 


April 29, 2015 


DRAFT 



35 


[29] A. Tahbaz-Salehi and A. Jadbabaie, “A necessary and sufficient condition for consensus over random networks,” Automatic 
Control, IEEE Transactions on, vol. 53, no. 3, pp. 791-795, April 2008. 

[30] E. Seneta, Nonnegative Matrices and Markov Chains. New York: Springer, 1981. 

[31] S. Kirkland, “Subdominant eigenvalues for stochastic matrices with given column sums,” Electronic Journal of Linear 
Algebra, vol. 18, 2009. 

[32] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge, United Kingdom: Cambridge Univesity Press, 1990. 

[33] J.-B. Hiriart-Urruty and C. Lemarechal, Fundamentals of Convex Analysis, ser. Grundlehren Text Editions. Berlin, 
Germany: Springer-Verlag, 2004. 

[34] A. F. Karr, Probability. New York: Springer-Verlag, 1993. 

[35] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convex optimization over random networks,” Automatic 
Control, IEEE Transactions on, vol. 56, no. 6, pp. 1291-1306, June 2011. 

[36] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.1.” 

[37] -, “Graph implementations for nonsmooth convex programs,” in Recent Advances in Learning and Control, ser. Lecture 

Notes in Control and Information Sciences, V. Blondel, S. Boyd, and H. Kimura, Eds. Springer-Verlag Limited, 2008, 
pp. 95-110. 

[38] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems and Control Letters, vol. 53, pp. 65-78, 
2003. 


April 29, 2015 


DRAFT 



